Reliable application of machine learning-based decision systems in the wild is one of the major challenges currently investigated by the field. A large portion of established approaches aims to detect erroneous predictions by means of assigning confidence scores. This confidence may be obtained by either quantifying the model's predictive uncertainty, learning explicit scoring functions, or assessing whether the input is in line with the training distribution. Curiously, while these approaches all state to address the same eventual goal of detecting failures of a classifier upon real-life application, they currently constitute largely separated research fields with individual evaluation protocols, which either exclude a substantial part of relevant methods or ignore large parts of relevant failure sources. In this work, we systematically reveal current pitfalls caused by these inconsistencies and derive requirements for a holistic and realistic evaluation of failure detection. To demonstrate the relevance of this unified perspective, we present a large-scale empirical study for the first time enabling benchmarking confidence scoring functions w.r.t all relevant methods and failure sources. The revelation of a simple softmax response baseline as the overall best performing method underlines the drastic shortcomings of current evaluation in the abundance of publicized research on confidence scoring. Code and trained models are at https://github.com/IML-DKFZ/fd-shifts.
translated by 谷歌翻译
Artificial Intelligence (AI) is having a tremendous impact across most areas of science. Applications of AI in healthcare have the potential to improve our ability to detect, diagnose, prognose, and intervene on human disease. For AI models to be used clinically, they need to be made safe, reproducible and robust, and the underlying software framework must be aware of the particularities (e.g. geometry, physiology, physics) of medical data being processed. This work introduces MONAI, a freely available, community-supported, and consortium-led PyTorch-based framework for deep learning in healthcare. MONAI extends PyTorch to support medical data, with a particular focus on imaging, and provide purpose-specific AI model architectures, transformations and utilities that streamline the development and deployment of medical AI models. MONAI follows best practices for software-development, providing an easy-to-use, robust, well-documented, and well-tested software framework. MONAI preserves the simple, additive, and compositional approach of its underlying PyTorch libraries. MONAI is being used by and receiving contributions from research, clinical and industrial teams from around the world, who are pursuing applications spanning nearly every aspect of healthcare.
translated by 谷歌翻译
在医学图像中的对象的同时定位和分类,也称为医学对象检测,是高临床相关性,因为诊断决策通常依赖于物体的评级而不是例如像素。对于此任务,方法配置的繁琐和迭代过程构成了一个主要的研究瓶颈。最近,NNU-Net在巨大成功中解决了图像细分任务的挑战。在NNU-Net的议程之后,在这项工作中,我们系统化并自动化了医疗对象检测的配置过程。由此产生的自配置方法NNDetection,在没有任何手动干预到任意医学检测问题的情况下适应本身,同时实现结果腹板或优于现有技术。我们展示了NNDetection对两台公共基准,亚当和Luna16的有效性,并提出了关于综合方法评估的公共数据集的进一步医疗对象检测任务。代码是https://github.com/mic-dkfz/nndetection。
translated by 谷歌翻译
The number of international benchmarking competitions is steadily increasing in various fields of machine learning (ML) research and practice. So far, however, little is known about the common practice as well as bottlenecks faced by the community in tackling the research questions posed. To shed light on the status quo of algorithm development in the specific field of biomedical imaging analysis, we designed an international survey that was issued to all participants of challenges conducted in conjunction with the IEEE ISBI 2021 and MICCAI 2021 conferences (80 competitions in total). The survey covered participants' expertise and working environments, their chosen strategies, as well as algorithm characteristics. A median of 72% challenge participants took part in the survey. According to our results, knowledge exchange was the primary incentive (70%) for participation, while the reception of prize money played only a minor role (16%). While a median of 80 working hours was spent on method development, a large portion of participants stated that they did not have enough time for method development (32%). 25% perceived the infrastructure to be a bottleneck. Overall, 94% of all solutions were deep learning-based. Of these, 84% were based on standard architectures. 43% of the respondents reported that the data samples (e.g., images) were too large to be processed at once. This was most commonly addressed by patch-based training (69%), downsampling (37%), and solving 3D analysis tasks as a series of 2D tasks. K-fold cross-validation on the training set was performed by only 37% of the participants and only 50% of the participants performed ensembling based on multiple identical models (61%) or heterogeneous models (39%). 48% of the respondents applied postprocessing steps.
translated by 谷歌翻译
已经开发了各种方法来结合多组结果的推理,以在集合和共识聚类文献中进行无监督的聚类。从几个候选聚类模型中的一个“最佳”模型报告结果的方法通常忽略了由模型选择产生的不确定性,并且导致对所选择的特定模型和参数敏感的推论,以及制作的假设,尤其是在小样本中所做的假设。尺寸或小簇尺寸。贝叶斯模型平均(BMA)是一种在多种模型中结合结果的流行方法,这些模型在这种情况下提供了一些有吸引力的好处,包括对组合集群结构的概率解释和基于模型的不确定性的量化。在这项工作中,我们介绍了ClusterBMA,该方法可以通过多种无监督聚类算法进行加权模型平均。我们将聚类内部验证标准的组合用作后验模型概率的新近似值,以加权每个模型的结果。从代表跨模型的聚类溶液的加权平均值的组合后相似性矩阵,我们应用对称的单纯形矩阵分解来计算最终的概率群集分配。此方法在随附的R软件包中实现。我们通过案例研究探索这种方法的性能,该案例研究旨在根据脑电图(EEG)数据识别个体的概率簇。我们还使用仿真数据集探索所提出的技术识别稳健的集成簇具有不同级别的集成簇,并在子组之间的分离水平变化,并且模型之间的簇数量变化。
translated by 谷歌翻译
ICECUBE是一种用于检测1 GEV和1 PEV之间大气和天体中微子的光学传感器的立方公斤阵列,该阵列已部署1.45 km至2.45 km的南极的冰盖表面以下1.45 km至2.45 km。来自ICE探测器的事件的分类和重建在ICeCube数据分析中起着核心作用。重建和分类事件是一个挑战,这是由于探测器的几何形状,不均匀的散射和冰中光的吸收,并且低于100 GEV的光,每个事件产生的信号光子数量相对较少。为了应对这一挑战,可以将ICECUBE事件表示为点云图形,并将图形神经网络(GNN)作为分类和重建方法。 GNN能够将中微子事件与宇宙射线背景区分开,对不同的中微子事件类型进行分类,并重建沉积的能量,方向和相互作用顶点。基于仿真,我们提供了1-100 GEV能量范围的比较与当前ICECUBE分析中使用的当前最新最大似然技术,包括已知系统不确定性的影响。对于中微子事件分类,与当前的IceCube方法相比,GNN以固定的假阳性速率(FPR)提高了信号效率的18%。另外,GNN在固定信号效率下将FPR的降低超过8(低于半百分比)。对于能源,方向和相互作用顶点的重建,与当前最大似然技术相比,分辨率平均提高了13%-20%。当在GPU上运行时,GNN能够以几乎是2.7 kHz的中位数ICECUBE触发速率的速率处理ICECUBE事件,这打开了在在线搜索瞬态事件中使用低能量中微子的可能性。
translated by 谷歌翻译
脑小血管疾病的成像标记提供了有关脑部健康的宝贵信息,但是它们的手动评估既耗时又受到实质性内部和间际变异性的阻碍。自动化评级可能受益于生物医学研究以及临床评估,但是现有算法的诊断可靠性尚不清楚。在这里,我们介绍了\ textIt {血管病变检测和分割}(\ textit {v textit {where valdo?})挑战,该挑战是在国际医学图像计算和计算机辅助干预措施(MICCAI)的卫星事件中运行的挑战(MICCAI) 2021.这一挑战旨在促进大脑小血管疾病的小而稀疏成像标记的自动检测和分割方法的开发,即周围空间扩大(EPVS)(任务1),脑微粒(任务2)和预先塑造的鞋类血管起源(任务3),同时利用弱和嘈杂的标签。总体而言,有12个团队参与了针对一个或多个任务的解决方案的挑战(任务1 -EPVS 4,任务2 -Microbleeds的9个,任务3 -lacunes的6个)。多方数据都用于培训和评估。结果表明,整个团队和跨任务的性能都有很大的差异,对于任务1- EPV和任务2-微型微型且对任务3 -lacunes尚无实际的结果,其结果尤其有望。它还强调了可能阻止个人级别使用的情况的性能不一致,同时仍证明在人群层面上有用。
translated by 谷歌翻译
预测编码网络(PCN)旨在学习世界的生成模型。给定观察结果,可以倒入该生成模型以推断这些观察结果的原因。但是,当训练PCNS时,通常会观察到明显的病理学,而推理精度峰值峰值,然后通过进一步的训练下降。这不能通过过度拟合来解释,因为训练和测试准确性同时降低。在这里,我们对这种现象进行了彻底的研究,并表明它是由PCN层面各个层之间的速度之间的不平衡引起的。我们证明,可以通过在每一层的重量矩阵正规化:限制矩阵奇异值的相对大小来防止这一点,我们允许重量矩阵改变,但限制了一层可以对其邻居产生的整体影响。我们还证明,通过仅限制权重的更加合理和简单的方案,可以实现类似的效果。
translated by 谷歌翻译
在本文中,我们提出了一种用于在高光谱图像中聚类的新动力学系统算法。该算法的主要思想是,数据点是\``推动\''的方向,该方向是增加密度和最终位于同一密集区域的像素组属于同一类。这本质上是由数据歧管上数据点密度梯度定义的微分方程的数值解。类的数量是自动化的,所得聚类可能非常准确。除了提供准确的聚类外,该算法还提出了一种新的工具,可以理解高维度的高光谱数据。我们在Urban上评估了算法(可在www.tec.ary.mil/hypercube/上获得)场景,将性能与K-Means算法进行比较,使用预识别的材料类作为地面真理。
translated by 谷歌翻译
在安全 - 关键系统(例如临床诊断)中,可解释的AI(XAI)是必不可少的,这是由于致命决定的高风险。但是,目前,XAI类似于一系列宽松的方法,而不是定义明确的过程。在这项工作中,我们详细介绍了XAI最大的亚组,可解释的机器学习(IML)和经典统计数据之间的概念相似性。基于这些相似之处,我们沿着统计过程的路线提出了IML的形式化。采用这种统计视图使我们能够将机器学习模型和IML方法解释为复杂的统计工具。基于这种解释,我们推断出三个关键问题,我们认为这对于在安全至关重要的环境中的成功和采用至关重要。通过提出这些问题,我们进一步旨在激发有关IML与古典统计数据的区别以及我们对该领域未来意味着什么的讨论。
translated by 谷歌翻译